Phonetic speech-to-text-to-speech system and method

ABSTRACT

A speech-to-text-to-speech for use with on-line and real time transmission of speech with a small bandwidth from a source to a destination. A speech is received and broken down to phonemes, which are encoded into series of symbols compatible with communication systems and other than a known symbolic representation of the speech in a known language for being transmitted through communication networks. When received, the series of symbols is decoded to restore the phonemes and for reconstituting a speech according to the phonemes prior to being communicated to a listening party.

Field of the Invention

[0001] The present invention relates to speech to text systems, and morespecifically to speech to text phonetic systems for use with on-line andreal time communication systems.

BACKGROUND OF THE INVENTION

[0002] A typical way of transforming speech into text is to create anddictate a document, which is then temporarily recorded by a recordingapparatus such as a tape recorder. A secretary, a typist, or the likereproduces the dictated contents using a documentation apparatus such asa typewriter, word processor, or the like.

[0003] Along with a recent breakthrough in speech recognition technologyand improvement in performance of personal computers, a technology fordocumenting voice input through a microphone connected to a personalcomputer by recognizing speech within application software running inthe personal computer, and displaying the document has been developed.However, it is difficult for a speech recognition system to carry outpractical processing within an existing computer, especially a personalcomputer because the data size of language models becomes enormous.

[0004] Inconveniently, such an approach necessitates either training ofa computer to respond to a single user having a voice profile that isdistinguished through training or a very small recognisable vocabulary.For example, trained systems are excellent for voice speech recognitionapplications but they fail when another user dictates or when thecorrect user has a cold or a sore throat. Further, the process takestime and occupies a large amount of disk space since it relies ondictionaries of words and spell and grammar checking to form accuratesentences from dictated speech.

[0005] Approaches to speech synthesis rely on text provided in the formof recognisable words. These words are then converted into knownpronunciation either through rule application or through a dictionary ofpronunciation. For example, one approach to human speech synthesis isknown as concatenative. Concatenative synthesis of human speech is basedon recording waveform data samples of real human speech of predeterminedtext. Concatenative speech synthesis then breaks down the pre-recordedoriginal human speech into segments and generates speech utterances bylinking these human speech segments to build syllables, words, orphrases. Various approaches to segmenting the recorded original humanvoice have been used in concatenative speech synthesis. One approach isto break the real human voice down into basic units of contrastivesound. These basic units of contrastive sound are commonly known asphones or phonemes.

[0006] Because of the way speech to text and text to speech systems aredesigned, they function adequately with each other and with text-basedprocesses. Unfortunately, such a design renders both systems cumbersomeand overly complex. A simpler speech-to-text and text-to-speechimplementation would be highly advantageous.

[0007] It would be advantageous to provide with a system that requiresreduced bandwidth to support voice communication.

OBJECT OF THE INVENTION

[0008] Therefore, it is an object of the present invention to providewith a system that allows for on-line transmission of speech with asmall bandwidth from a source to a destination.

SUMMARY OF THE INVENTION

[0009] In accordance with a preferred embodiment of the presentinvention, there is provided a speech-to-text-to-speech systemcomprising:

[0010] a first input port for receiving a speech;

[0011] a first processor in communication with the first input port, thefirst processor for identifying phonemes within the received speech andfor encoding the phonemes into series of symbols compatible withcommunication systems, the series of symbols other than a known symbolicrepresentation of the speech in a known language;

[0012] a first output port in connection with the first processor, thefirst output port for transmitting the series of symbols;

[0013] a second input port for receiving the series of symbols;

[0014] a second processor in communication with the second input port,the second processor for decoding the series of symbols to restore thephonemes and for reconstituting speech according to the phonemes; and,

[0015] a second output port for providing a signal indicative of thereconstituted speech,

[0016] wherein the reconstituted speech is similar to the receivedspeech.

[0017] In accordance with another preferred embodiment of the presentinvention, there is provided a method of transmitting a speech on-linecomprising the steps of:

[0018] providing speech;

[0019] identifying phonemes within the received speech;

[0020] encoding the phonemes into series of symbols compatible with acommunication system, the series of symbols other than a known symbolicrepresentation of the speech in a known language;

[0021] transmitting the series of symbols via a communication medium;

[0022] receiving the series of symbols;

[0023] decoding the series of symbols to provide a signal representativeof the speech and including data reflective of the phonemesreconstituted to form reconstituted speech similar to the receivedspeech.

BRIEF DESCRIPTION OF THE DRAWINGS

[0024] Exemplary embodiments of the invention will now be described inconjunction with the following drawings, in which:

[0025]FIG. 1 is a bloc diagram of a prior art text-to-speech system;

[0026]FIG. 2 is a schematic representation of a speech-to-text-to-speechsystem according to the invention;

[0027]FIG. 3a shows the first part of the speech-to-text-to-speechsystem, i.e. the speech-to-text portion according to a preferredembodiment of the present invention;

[0028]FIG. 3b shows the second part of the speech-to-text-to-speechsystem, i.e. the text-to-speech portion according to the preferredembodiment of the present invention;

[0029]FIG. 4a shows the first part of the speech-to-text-to-speechsystem, i.e. the speech-to-text portion according to another preferredembodiment of the present invention;

[0030]FIG. 4b shows the second part of the speech-to-text-to-speechsystem, i.e. the text-to-speech portion according to the other preferredembodiment of the present invention; and,

[0031]FIG. 5 is a flow chart diagram of a method of on-line andreal-time communicating.

DETAILED DESCRIPTION OF THE INVENTION

[0032] Referring to FIG. 1, a bloc diagram of a prior art text-to-speechsystem 10 is shown. Text 12 is provided via an input device in the formof a computer having a word processor, a printer, a keyboard and soforth, to a processor 14 such that the text is analysed using forexample a dictionary to translate the text using a phonetic translation.The processor 14 specifies the correct pronunciation of the incomingtext by converting it into a sequence of phonemes. A pre-processing ofthe symbols, numbers, abbreviations, etc, is performed such that thetext is first normalized and then converted to its phoneticrepresentation by applying for example a lexicon table look-up.Alternatively, morphological analysis, letter-to-sound rules, etc. areused to convert the text to speech.

[0033] The phoneme sequence derived from the original text istransmitted to acoustic processor 16 to convert the phoneme sequenceinto various synthesizer controls, which specify the acoustic parametersof corresponding output speech. Optionally, the acoustic processorcalculates controls for parameters such as prosody—i.e. pitch contoursand phoneme duration—voicing source—e.g. voiced or noise—transitionalsegmentation—e.g. formants, amplitude envelopes—and/or voice colour—e.g.timbre variations—.

[0034] A speech synthesizer 18 receives as input the control parametersfrom the acoustic processor. The speech synthesizer converts the controlparameters of the phoneme sequence derived from the original text intooutput waveforms representative of the corresponding spoken text. Aloudspeaker 19 receives as input the output wave forms from the speechsynthesizer 18 and outputs the resulting synthesized speech of the text.

[0035] Referring to FIG. 2, a schematic representation of aspeech-to-text-to-speech system is shown. Speech is provided to aprocessor 20 through an input device in the form, for example, of amicrophone 21. The processor 20 breaks the spoken words down intophonemes and translates the provided speech into a text that correspondsto the speech in a phonetic form. The phonetic text is transmitted via atelecommunication network such as the Internet or a public telephonyswitching system and is received by device 21 including a processor forrestoring the speech according to the phonemes received. The restoredspeech is then provided to an output port 22 in the form for example ofa loudspeaker.

[0036] Of course, the speech is provided either in a direct way, i.e. anindividual speaks through a microphone connected to thespeech-to-text-to-speech system, or using a device such as a tape onwhich the speech was previously recorded and from which it is read.

[0037] The speech-to-text-to-speech system, according to an embodimentof the present invention is detailed in FIGS. 3a and 3 b, wherein FIG.3a shows the first part of the speech-to-text-to-speech system, i.e. thespeech-to-text portion, whereas FIG. 3b shows the second part of thespeech-to-text-to-speech system, i.e. the text-to-speech portion.

[0038] Referring to FIG. 3a, the speech to text portion of the system isin the form of a device including an input port 31 for receiving aspeech to transform, the input port 31 for connecting with themicrophone 21 for example. The input port 31 is in communication with atranslator 32. The purpose of the translator is to identify the phonemeswhen a part of a speech is received at input port 31, and to provide theidentified phoneme to a text generator 33. The text generator is in theform for example of a phonetic word processor, which transcribes thephonemes into corresponding series of written characters such that aphonetic transcript of the original speech is generated. The phonetictranscript is modified at 41 in order to render it compatible fortelecommunication transmission system and communicated to the outputport 34 and transmitted out. The modification includes encoding thephonetic characters, the encoding resulting in a series of symbols asfor example alphanumeric equivalent or ASCII codes according to look-uptables. The encoding preferably results in a symbolic representation ofthe speech other than in a known human intelligible language.

[0039] Optionally, the table is transmitted along with the message suchthat, upon reception of the message, the decoding of the message isperformed using the same look-up table, which enhances the similaritiesbetween the generated speech and the provided speech.

[0040] For example, in operation, the sentence: “hello, the sun isshining” is a sentence provided as speech; it is received at the inputport and processed by the speech-to-text portion of the system. Thetranslator identifies the following phonemes:

[0041]

which does not indicate the punctuation nor the word partitions. Ofcourse, the phonemes may include silent phonemes to indicate pauses suchas are common between words or at the end of sentences.

[0042] This resulting series of phonetic characters corresponds to theoriginal sentence. Advantageously, such phonetic language alreadyincorporates indication regarding a way of speaking of the individualproviding the speech such as an accent, a tempo, a rhythm, etc. Somephonetic characters are different when a phoneme is spoken with a tonicaccent.

[0043] Optionally, the system also includes a sound analyzer 35 forproviding value indicative of vocal parameters as for example high/low,fast/slow, or tonic/not tonic when a phonetic character does not alreadyexist.

[0044] An example added values is a “1” associated with a phoneme whenit is in a high pitch and a “0” for lower frequency relatively to apreset medium frequency. A further associated “1” indicates a fast and a“0” a slow pronunciation speed relatively to a preset medium speed.

[0045] Of course, it is possible that each of these parameters isassociated with the phoneme. Alternatively, only one or a combination ofparameters is associated with a phoneme.

[0046] Referring back to the exemplary sentence, if it is pronouncedsuch that the word “hello” is accentuated on the first syllabus, and thesecond one is accentuated and long or slowly pronounced. Thecharacterization of the word incorporates the vocal flexibility and theresulting translated word according to the encoding example is: h

'l

1 10

[0047] Of course, many other ways of pronouncing the word “hello” existas for example skipping the beginning “h”, or transforming the syllabus“he” to sound more like “hu”. Regardless of the pronunciation style, thetranslator transforms the signal that is received without attempting toidentify the word and providing a “restored” phonetic translation.

[0048] Referring to FIG. 3b, the encoded transmitted text is received atinput port 36 of the text-to-speech portion of thespeech-to-text-to-speech system. The encoded text is transmitted to thespeech generator 38 for decoding the text using a look-up table forreconstituting a speech based on the phonetic characters to an outputport 40. The output port is in the form for example of a loudspeaker ora headphone when a listening party prefers to listen the speech in amore private environment.

[0049] Referring back to our example: The sentence originally spoken is:“hello, the sun is shining”.

[0050] The resulting phonetic transform,

is encoded and the series of symbols corresponding to the phonemes aretransmitted through the output port 34 and received at the input port36.

[0051] The following phrase: “hellothesunisshinning” is reconstituted bythe speech generator by performing a reverse operation, i.e. decodingthe series of symbols for restoring the phonemes and for delivering themessage at the output port 40 to a listening party. A loud voice readingof such a text results in recovering the original broken down speech.

[0052] Optionally, the system includes a vocalizer 39, which is incommunication with the speech generator and integrates vocal parametersif any were associated with the phonetic characters and provides to theoutput port sounds reflecting voice inflexion of the individual havingspoken the original speech.

[0053] Of course, the breakdown of a speech into symbols correspondingto known phonemes for direct transcription into a text typically rendersthe text unintelligible. In fact such a transcript would look likeseries of symbols, each symbol corresponding to a phoneme.

[0054] Advantageously, in such a system, the text is a transitory stepof the process and is preferably not used for editing or publishingpurpose for example. Therefore, there is no need of performing an exacttranscription of the speech; there is no need of specific applicationsoftware for comparing a word with a dictionary, for determininggrammatical rules and so forth. Consequently, each phoneme isrepresented with a few bits, which favor a speed of transmission of thetext.

[0055] An international phonetic alphabet exists, which is preferablyused with such a speech-to-text-to-speech system for unifying the systemsuch that the text generator and the speech generator are compatible onewith the other.

[0056] The speech-to-text-to-speech system, according to anotherembodiment of the present invention is detailed in FIGS. 4a and 4 b,wherein FIG. 4a shows the first part of the speech-to-text-to-speechsystem, i.e. the speech-to-text portion, whereas FIG. 4b shows thesecond part of the speech-to-text-to-speech system, i.e. thetext-to-speech portion.

[0057] Referring to FIG. 4a, the speech to text portion of the system isin the form of a device including an input port 42 for receiving speechto transform, the input port 42 for connecting with the microphone 21for example. The input port 42 is in communication with a languageselector 43 and optionally with a speaker identifier 44. The purpose ofthe language selector 43 is to identify a language of the speech priorto a communication session. Language identification is provided at thebeginning of the transmitted phonetic text in the form of ENG forEnglish, FRA for French, ITA for Italian and so forth. Upon identifyinga language, a phoneme database is selected from the database 45 suchthat only the phonemes corresponding to the identified language are usedto translate the speech into a phonetic text by the text generator 46 inthe form for example of a phonetic processor, which transcribes thephonemes into corresponding series of written characters such that aphonetic transcript of the original speech is generated. When the systemcomprises a speaker identifier 44, the speaker provides his name or anyindication of his identity, which is then associated with the generatedtext before being transmitted. The phonetic transcript is modified at 47in order to render it compatible for telecommunication transmissionsystem and communicated to the output port 48 and transmitted out. Aswill be apparent to those of skill in the art, language dependentphonetic dictionaries allow for improved compression of the speech andfor improved phoneme extraction. Alternatively, language and regionalcharacteristics are used to select a phonetic dictionary such as Sco forScottish English and Iri for Irish English in order to improve thephonetic dictionary for a particular accent or mode of speaking. Furtheralternatively, a speaker dependent phonetic dictionary is employed. Ofcourse, it is preferable that a same dictionary is available at thereceiving end for use in regenerating the speech in an approximatelyaccurate fashion.

[0058]FIG. 4b is a bloc diagram of the second portion of thespeech-to-text-to-speech system when a language is identified prior to acommunication session. The transmitted phonetic text is received atinput port 49 of the text-to-speech portion of thespeech-to-text-to-speech system. The language of the phonetic text isidentified by the language identifier 50, which allows selecting thephonemes corresponding to the identified language from a phoneticdatabase 51. The speech generator 52 provides reconstituted speech basedon the phonetic characters to an output port 55.

[0059] Advantageously, the identification of the language and theconcomitant selection of the phonemes from the phonetic databaseimproves a quality of the translation of the speech to a phonetic text.Similarly, upon receiving a phonetic text, an indication of the originallanguage increases the quality of the restored speech.

[0060] Optionally, the system includes a vocalizer 53, which is incommunication with the speech generator and integrates vocal parametersthat are associated with the phonetic characters and provides to theoutput port sounds reflecting voice inflexion of the individual havingspoken the original speech. When a speaker independent or language andregion independent dictionary is used, the dictionary preferablyincludes vocal parameters to characterize such as tone, pitch, speed,gutteral quality, whisper, etc.

[0061] Further optionally, the system comprises a memory where vocalcharacteristics of a various people are stored. This is advantageouswhen the speaker and the listener know each other and each has a profilecorresponding to the other stored in the memory. A profile associated toan individual comprises the individual's voice inflections, pitch, voicequality, and so forth. Upon receiving a phonetic text having anidentification of the speaker, the individual profile corresponding tothe speaker is extracted and combined to the vocal parameters associatedwith the received text. Thus, the reconstituted speech is declaimedusing the speaker's vocal characteristics instead of a standardcomputerized voice.

[0062] Of course, once a dictionary is present and stored within atranslating system, it is optional to have that system characterizespeech received to identify the language/region/speaker in an automatedfashion. Such speaker recognition is known in the art.

[0063] Referring to FIG. 5, a flow chart diagram of a method of usingthe speech-to-text-to-speech system is shown. An individual provides aspeech to the system that breaks down the speech as it is provided toidentify the phonemes in order to generate a text in a phonetic format.In a further step, the phonetic text is sent through an existingcommunication system to a computer system remotely located. Thetransmission of the phonetic text is a fast process especially usingInternet connections, and the texts such sent are usually small filestransmitted. Upon receiving the phonetic text, the speech generator onthe remote computer system reconstitutes a speech based upon thephonetic system used.

[0064] Optionally, upon reaching a predetermined length of phonetictext, the phonetic text is transmitted such that the predeterminedlength of phonetic text is processed by the speech-to-text portion ofthe system to reduce the delay during a conversation.

[0065] In some languages, as for example, Chinese, same words havedifferent meanings depending on their pronunciation. In these languages,the pronunciation is a limiting parameter. As is apparent to a personwith skill in the art, the system is implementable such that thepronunciation of the phonemes reflects the meaning of words.

[0066] Numerous other embodiments may be envisaged without departingfrom the spirit or scope of the invention.

What is claimed is:
 1. A speech-to-text-to-speech system comprising: afirst input port for receiving a speech; a first processor incommunication with the first input port, the first processor foridentifying phonemes within the received speech and for encoding thephonemes into series of symbols compatible with communication systems,the series of symbols other than a known symbolic representation of thespeech in a known language; a first output port in connection with thefirst processor, the first output port for transmitting the series ofsymbols; a second input port for receiving the series of symbols; asecond processor in communication with the second input port, the secondprocessor for decoding the series of symbols to restore the phonemes andfor reconstituting speech according to the phonemes; and, a secondoutput port for providing a signal indicative of the reconstitutedspeech, wherein the reconstituted speech is similar to the receivedspeech.
 2. A speech to text to speech system according to claim 1comprising a transducer for communicating the reconstituted speech to alistening party.
 3. A speech-to-text-to-speech system according to claim2, comprising a first memory for storing phonetic database therein, thephonetic database including all phonemes of each of a plurality oflanguages.
 4. A speech-to-text-to-speech system according to claim 3,comprising a second memory for storing at least a look-up table therein,the look-up table including symbols representative of the phonemes forencoding the phonemes into series of symbols compatible withcommunication systems and other than a known symbolic representation ofthe speech in a known language.
 5. A speech-to-text-to-speech systemaccording to claim 2, wherein the second processor comprises a voicegenerator for generating a signal, the signal, when provided to aspeaker for resulting in the reconstituted speech.
 6. Aspeech-to-text-to-speech system according to claim 5, comprising a thirdmemory for storing voice profile of speakers for personalizing thereconstituted speech when generated.
 7. A speech-to-text-to-speechsystem according to claim 2, wherein the first processor comprises asound analyzer for identifying at least one of a pitch, a speed and atone with which a phoneme is spoken and for associating a valueindicative of a the pitch, speed and tone relative to a preset mediumlevel.
 8. A speech-to-text-to-speech system according to claim 2,wherein the first output port and the second input port are networkconnections for coupling with a wide area network.
 9. Aspeech-to-text-to-speech system according to claim 8, wherein the secondoutput port comprises a speaker.
 10. A method of transmitting a speechon-line comprising the steps of: providing speech; identifying phonemeswithin the received speech; encoding the phonemes into series of symbolscompatible with a communication system, the series of symbols other thana known symbolic representation of the speech in a known language;transmitting the series of symbols via a communication medium; receivingthe series of symbols; decoding the series of symbols to provide asignal representative of the speech and including data reflective of thephonemes reconstituted to form reconstituted speech similar to thereceived speech.
 11. A method according to claim 10 comprising the stepsof: reconstituting a speech according to the signal; and, communicatingthe reconstituted speech to a listening party.
 12. The method accordingto claim 11, wherein the step of providing a speech comprises the stepof speaking into a microphone.
 13. The method according to claim 11,wherein the step of communicating the reconstituted speech to alistening party comprises the step of providing the reconstituted speechto at least a speaker.
 14. The method according to claim 11, wherein thestep of encoding the phonemes into series of symbols comprises the stepof: identifying a language of the speech; selecting from a phoneticdatabase a look-up table from a plurality of different look-up tablesand associated with the identified language; providing with a symbolicrepresentation of the identified phonemes in accordance with theselected look-up table.
 15. The method according to claim 14, whereinthe step of decoding the series of symbols to restore the phonemescomprises the steps of: identifying the language of the provided speech;selecting from a phonetic database a look-up table from a plurality ofdifferent look-up tables and associated with the identified language;providing with a phonetic representation of the series of symbols inaccordance with the selected look-up table.
 16. The method according toclaim 11, wherein the step of transmitting the series of symbolscomprises the step of attaching the selected look-up table to the seriesof symbols.
 17. The method according to claim 16, wherein the step ofdecoding the phonemes into series of symbols comprises the step of usingthe look-up table attached to the transmitted series of symbols.
 18. Themethod according to claim 11, wherein the step of identifying phonemeswithin the received speech comprises the steps of: characterizing atleast one voice related parameter; and, encoding a value indicative ofthe at least a voice related parameter in association with one or morephonemes.
 19. The method according to claim 11, wherein the step ofcommunicating the reconstituted speech to a listening party comprisesthe steps of: identifying a speaker; retrieved from a memory a voiceprofile of the speaker previously stored therein; and, reconstitutingthe speech using the retrieved voice profile.
 20. The method accordingto claim 11, wherein the communication medium includes the Internet. 21.A speech-to-text-to-speech system comprising: means for providingspeech; means for identifying phonemes within the received speech; meansfor encoding the phonemes into series of symbols compatible with acommunication system, the series of symbols other than a known symbolicrepresentation of the speech in a known language; means for transmittingthe series of symbols via a communication medium; means for receivingthe series of symbols; means for decoding the series of symbols toprovide a signal representative of the speech and including datareflective of the phonemes reconstituted to form reconstituted speechsimilar to the received speech.